Correction Approach to Word Segmentation

نویسندگان

  • Ekawat Chaowicharat
  • Kanlaya Naruedomkul
چکیده

A number of word segmentation algorithms have been offered in the past; however, there is still room for improvement. Co-occurrence-Based Error Correction (CBEC), the proposed approach in this chapter, is a novel Thai word segmentation approach that was designed to provide accurate segmentation results based on context and purpose. CBEC quickly segments the input string using any available algorithm; maximal matching was used in the experiment. Next, CBEC checks its segmentation output against an error risk data bank to determine if there is any error risk. The error risk data bank is developed based on a training corpus. The current version of the error risk bank was based on the training corpus available at BEST 2009. Then, CBEC re-segments the input string using the co-occurrence score of the word sequence to ensure the accuracy of the segmentation result. DOI: 10.4018/978-1-61350-447-5.ch023

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Co-Occurrence-Based Error Correction Approach to Word Segmentation

To overcome the problems in Thai word segmentation, a number of word segmentation has been proposed during the long period of time until today. We propose a novel Thai word segmentation approach so called Co-occurrence-Based Error Correction (CBEC). CBEC generates all possible segmentation candidates using the classical maximal matching algorithm and then selects the most accurate segmentation ...

متن کامل

Comparison of state-of-the-art atlas-based bone segmentation approaches from brain MR images for MR-only radiation planning and PET/MR attenuation correction

Introduction: Magnetic Resonance (MR) imaging has emerged as a valuable tool in radiation treatment (RT) planning as well as Positron Emission Tomography (PET) imaging owing to its superior soft-tissue contrast. Due to the fact that there is no direct transformation from voxel intensity in MR images into electron density, itchr('39')s crucial to generate a pseudo-CT (Computed Tomography) image ...

متن کامل

Non-Deterministic Segmentation for Chinese Lattice Parsing

Parsing Chinese critically depends on correct word segmentation for the parser since incorrect segmentation inevitably causes incorrect parses. We investigate a pipeline approach to segmentation and parsing using word lattices as parser input. We compare CRF-based and lexicon-based approaches to word segmentation. Our results show that the lattice parser is capable of selecting the correction s...

متن کامل

Word segmentation in Persian continuous speech using F0 contour

Word segmentation in continuous speech is a complex cognitive process. Previous research on spoken word segmentation has revealed that in fixed-stress languages, listeners use acoustic cues to stress to de-segment speech into words. It has been further assumed that stress in non-final or non-initial position hinders the demarcative function of this prosodic factor. In Persian, stress is retract...

متن کامل

Handwritten ZIP code recognition using lexicon free word recognition algorithm

This paper describes a new approach to ZIP code recognition using a word recognition algorithm, where a numeral string is recognized as a word. This paper also describes an end to end ZIP code recognition system consisting of tiltlslant correction, line segmentation, word segmentation, ZIP code location, as well as the ZIP code recognition. Evaluation tests are performed using address block ima...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016